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> 

1/^ ■ K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortu- 

V^ ' nately, due to its gradient descent nature, this algorithm is highly sensitive to the initial 

^+ ' placement of the cluster centers. Numerous initialization methods have been proposed 

^^ , to address this problem. Many of these methods, however, have superlinear complex- 

ity in the number of data points, making them impractical for large data sets. On the 
other hand, linear methods are often random and/or order-sensitive, which renders their 
results unrepeatable. Recently, Su and Dy proposed two highly successful hierarchical 
initialization methods named Var-Part and PCA-Part that are not only linear, but also 
deterministic (non-random) and order-invariant. In this paper, we propose a discrimi- 
nant analysis based approach that addresses a common deficiency of these two methods. 
Experiments on a large and diverse collection of data sets from the UCI Machine Learn- 
J^ , ing Repository demonstrate that Var-Part and PCA-Part are highly competitive with 

$H ' one of the best random initialization methods to date, i.e., k-meansH — h, and that the 

^ ' proposed approach significantly improves the performance of both hierarchical methods. 

Keywords: Partitional clustering; sum of squared error criterion; k-means; cluster center 
initialization; thresholding. 

1. Introduction 

Clustering, the unsupervised classification of patterns into groups, is one of the 
most important tasks in exploratory data analysis ^. Primary goals of clustering 
include gaining insight into data (detecting anomalies, identifying salient features, 
etc.), classifying data, and compressing data. Clustering has a long and rich his- 
tory in a variety of scientific disciplines including anthropology, biology, medicine, 
psychology, statistics, mathematics, engineering, and computer science. As a result, 
numerous clustering algorithms have been proposed since the early 1950s l^. 

Clustering algorithms can be broadly classified into two groups: hierarchical 
and partitional l^. Hierarchical algorithms recursively find nested clusters either in 
a top-down (divisive) or bottom- up (agglomerative) fashion. In contrast, partitional 
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algorithms find all the clusters simultaneously as a partition of the data and do not 
impose a hierarchical structure. Most hierarchical algorithms have quadratic or 
higher complexity in the number of data points ^ and therefore are not suitable for 
large data sets, whereas partitional algorithms often have lower complexity. 

Given a data set X = {xi, X2, . . . , x^r} in M^, i.e., N points (vectors) each with 
D attributes (components), hard partitional algorithms divide X into K exhaustive 
and mutually exclusive clusters V = {Pi,P2, ■ ■ ■ , Pr}, Ui=i ^i ^ -^ ^ Pi^ Pj = ^ 
for 1 < i ^ j < K. These algorithms usually generate clusters by optimizing a 
criterion function. The most intuitive and frequently used criterion function is the 
Sum of Squared Error (SSE) given by: 

K 

ssE = E E K-^'^ii2 (1) 

where ||.||2 denotes the Euclidean (£2) norm and c^ = l/l^ilSx eP ^i i^ ^^^ 
centroid of cluster Pi whose cardinality is \Pi\. The optimization of (tlj is often 
referred to as the minimum SSE clustering (MSSC) problem. 

The number of ways in which a set of A^ objects can be partitioned into K 
non-empty groups is given by Stirling numbers of the second kind: 

5(7V,/0 = -^f](-l)^-M ^'V^ (2) 

which can be approximated by K^ / K\ It can be seen that a complete enumeration 
of all possible clusterings to determine the global minimum of ([T]) is clearly com- 
putationally prohibitive except for very small data setsl^. In fact, this non-convex 
optimization problem is proven to be NP-hard even for iiT = 2 1^ or £> = 2 1^. Con- 
sequently, various heuristics have been developed to provide approximate solutions 
to this problem!^. Among these heuristics, Lloyd's algorithmic, often referred to as 
the (batch) k-means algorithm, is the simplest and most commonly used one. This 
algorithm starts with K arbitrary centers, typically chosen uniformly at random 
from the data points. Each point is assigned to the nearest center and then each 
center is recalculated as the mean of all points assigned to it. These two steps are 
repeated until a predefined termination criterion is met. 

The k-means algorithm is undoubtedly the most widely used partitional clus- 
tering algorithm'^. Its popularity can be attributed to several reasons. First, it is 
conceptually simple and easy to implement. Virtually every data mining software 
includes an implementation of it. Second, it is versatile, i.e., almost every aspect 
of the algorithm (initialization, distance function, termination criterion, etc.) can 
be modified. This is evidenced by hundreds of publications over the last fifty years 
that extend k-means in various ways. Third, it has a time complexity that is linear 
in N , D, and K (in general, D <^ N and K <^ N). For this reason, it can be used 
to initialize more expensive clustering algo rithms such as expectation maximi zation 
1^, DBSCANI^, and spectral clustering L!i!l. Furthermore, numerous sequential^ ^ ^ 
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and parallel ^^ acceleration techniques are available in the literature. Fourth, it 
has a storage complexity that is linear in N , D, and K. In addition, t her e exist 
disk-based variants that do n ot r equire all points to be s tored in memory L!^. Fifth, 
it is guaranteed to converge ^^ at a quadratic rate ^^. Finally, it is invariant to 
data ordering, i.e., random shufflings of the data points. 

On the other hand, k-means has several significant disadvantages. First, it re- 
quires the number of clusters, /•sT, to be specified in advance. The value of this 
parameter can be determ ined automatically by means of various internal/relative 
cluster validity measures ^^^. Second, it can only detect compact, hypcrspherical 
clusters that are well separated. This can be alleviated by using a more general 
distance function such as the Mahalanobis distance, which permits the detection 
of hyperellipsoidal clusters ^^. Third, due its utilization of the squared Euclidean 
distance, it is sensitive to noise and outlier points since even a few such points can 
significantly influence the means of their respective clusters. This can be addressed 
by outlier pruning L!^ or using a more robust distance function such as City-block 
(£i) distance. Fourth, due to its gradient descent nature, it often converges to a lo- 
cal minimum of the criterion function^^. For the same reason, it is highly sensitive 
to the selection of the initial centers l^. Adverse effects of improper initialization 
include empty clusters, slower convergence, and a higher chance of getting stuck in 
bad local minimal^. Fortunately, except for the first two, these drawbacks can be 
remedied by using an adaptive initialization method (IM). ^^^^^^^^ 

A large number of IMs have been proposed in the literature '^ ". Unfortu- 
nately, many of these have superlinear complexity in N^^ '^^ ^^ '^ '^'^ "^^ ^^ •^'-' '^^ ^^1^ 
which makes them impractical for large data sets (note that k-means itself has lin- 
ear complexity). In c ontrast, linear IMs are often random and/or order-sensitive 
| 33 | 34 | 35 | 36 | 37 | 38 | 8 | 39 | .^^j^i^h renders their results unrepeatable. Su and Dy proposed 
two divisive hierarchical initialization methods named Var-Par t an d PCA-Part that 
are not only linear, but also deterministic and order- invariant 1^^. In this study, we 
propose a simple modification to these methods that improves their performance 
significantly. 

The rest of the paper is organized as follows. Section [2] presents a brief overview 
of some of the most popular linear, order-invariant k-means IMs and the proposed 
modification to Var-Part and PCA-Part. Section [3] presents the experimental re- 
sults, while Section S] analyzes these results. Finally, Section [5] gives the conclusions. 

2. Linear, Order-Invariant Initialization Methods for K-Means 

2.1. Overview of the Existing Methods 

Forgy's methodic assigns each point to one of the K clusters uniformly at random. 
The centers are then given by the centroids of these initial clusters. This method 
has no theoretical basis, as such random clusters have no internal homogeneity'^. 
MacQueen l^ proposed two different methods. The first one, which is the de- 
fault option in the Quick Cluster procedure of IBM SPSS Statistics '^, takes the 
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first K points in X as the centers. An obvious drawback of this method is its sensi- 
tivity to data ordering. The second method chooses the centers randomly from the 
data points. The rationale behind this method is that random selection is likely to 
pick points from dense regions, i.e., points that are good candidates to be centers. 
However, there is no m echanism to avoid choosing outliers or points that are too 
close to each other 1^. Multiple runs of this method is the standard way of initial- 
izing k-means '^. It should be noted that this second method is often mistakenly 
attributed to Forgyl^. 

The maximin method 1^ chooses the first center Ci arbitrarily and the i-th {i G 
{2,3,..., K}) center c,; is chosen to be the point that has the greatest minimum- 
distance to the previously selected centers, i.e., Ci,C2, . . . ,Ci_i. This method was 
originally developed as a 2-a ppr oximation to the X-center clustering problenu. 

The k-means-|--|- method '^^ interpolates between MacQueen's second method 
and the maximin method. It chooses the first center randomly and the i-th 

{i £ {2,3, ... , K}) center is chosen to be x' G <¥ with a probability of ^^ — 



where md{x.) denotes the minimum-distance from a point x to the previously se- 
lected centers. This method yields an 9(log A') approximation to the MSSC prob- 
lem. 

The PCA-Part methodic uses a divisive hierarchical approach based on PCA 
(Principal Component Analysis) '^^. Starting from an initial cluster that contains 
the entire data set, the method iteratively selects the cluster with the greatest 
SSE and divides it into two subclusters using a hyperplane that passes through 
the cluster centroid and is orthogonal to the principal eigenvector of the cluster 
covariance matrix. This procedure is repeated until K clusters are obtained. The 
centers are then given by the centroids of these clusters. The Var-Part method '^^ 
is an approximation to PCA-Part, where the covariance matrix of the cluster to be 
split is assumed to be diagonal. In this case, the splitting hyperplane is orthogonal 
to the coordinate axis with the greatest variance. 

FigureJT] illustrates the Var-Part procedure on a toy data set with four natural 
clusters'^. In iteration 1, the initial cluster that contains the entire data set is split 
into two subclusters along the Y axis using a line (one-dimensional hyperplane) that 
passes through the mean point (92.026667). Between the resulting two clusters, the 
one above the line has a greater SSE. In iteration 2, this cluster is therefore split 
along the X axis at the mean point (66.975000). In the final iteration, the cluster 
with the greatest SSE, i.e., the bottom cluster, is split along the X axis at the 



mean point (41.057143). In Figure 1(d) the centroids of the final four clusters are 
denoted by stars. 



^Given a set of A'' points in a metric space, the goal of ii'-center clustering is to find K represen- 
tative points (centers) such that the maximum distance of a point to a center is minimized. Given 
a minimization problem, a 2- approximation algorithm is one that finds a solution whose cost is 
at most twice the cost of the optimal solution. 
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Fig. 1. Illustration of Var-Part on the Ruspini data set 



2.2. Proposed Modification to Var-Part and PCA-Part 

Su and Dy ^^ demonstrated that, besides being computationally efficient, Var- 
Part and PCA-Part perform very well on a variety of data sets. Recall that in each 
iteration these methods select the cluster with the greatest SSE and then project the 
i?-diniensional points in this cluster on a partitioning axis. The difference between 
the two methods is the choice of this axis. In Var-Part, the partitioning axis is the 
coordinate axis with the greatest variance, whereas in PCA-Part it is the major 
axis. After the projection operation, both methods use the mean point on the 
partitioning axis as a 'threshold' to divide the points between two clusters. In other 
words, each point is assigned to one of the two subclusters depending on which side 
of the mean point its projection falls to. It should be noted that the choice of this 
threshold is primarily motivated by computational convenience. Here, we propose 
a better alternative based on discriminant analysis. 

The projections of the points on the partitioning axis can be viewed as a discrete 
probability distribution, which can be conveniently represented by a histogram. 
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The problem of dividing a histogram into two partitions is a weU studied one in the 
field of image processing. A plethora of histogram partitioning, a.k.a. thresholding, 
methods has been proposed in the literature with the early ones dating back to 
the 1960s 1^. Among these, Otsu's methodic has bec ome the method of choice as 
confirmed by numerous comparative studies 1^ ^ '-' ^ 4 d | o1 | 0/! | 

Given an image represented by L gray levels {0, 1, . . . , L — 1}, a thresholding 
method partitions the image pixels into two classes Cg = {0,1,..., t} and Ci = 
{t + l,t + 2, . . . ,L — 1} (object and background, or vice versa) at gray level t. In 
other words, pixels with gray levels less than or equal to the threshold t are assigned 
to Co, whereas the remaining pixels are assigned to Ci. 

Let Ui be the number of pixels with gray level i. The total number of pixels in 
the image is then given by n = X]i=o "*■ ^^'^ normalized gray level histogram of 
the image can be regarded as a probability mass function: 

L-l 

Pi = —, Pi>0, ^^1 = 1 
n ^-^ 

4=0 

Let po{t) — X]i=oP* ^^"^ Pi(i) = 1 ^ Pait) denote the probabilities of Co and Ci, 
respectively. The means of the respective classes are then given by: 

/^o(i) = Kt)/Po{t) 

Mt) = {f^T~m)/pi{t) 

where fi{t) = X]i=o *P* ^^'^ I^t = fJ-iL — 1) denote the first moment of the histogram 
up to gray level t and mean gray level of the image, respectively. 

Otsu's method adopts between-class 

variance, i.e., cr'^(t) ~ pQ(t)pi{t) [^Q{t) ~ ^i{t)] , from the discriminant analysis 
literature as its objective function and determines the optimal threshold t* as the 

gray level that maximizes cr^(t), i.e., t* = argmax ag{t). Between-class vari- 

te{o,i,...,L-i} 
ance can be viewed as a measure of class separability or histogram bimodality. It 
can be efficiently calculated using: 

2 /.^ _ WPojt) - i^jt)? 
"^^'^ - P0it)p.{i) 

It should be noted that the efficiency of Otsu's method can be attributed to the 
fact that it operates on histogrammed pixel gray values, which are non-negative 
integers. Var-Part and PCA-Part, on the other hand, operate on the projections of 
the points on the partitioning axis, which are often fractional. This problem can be 
circumvented by linearly scaling the projection values to the limits of the histogram, 
i.e., and L — 1. Let yi be the projection of a point x^ on the partitioning axis, yi 
can be mapped to histogram bin 6 given by: 

liyi- min yj 

max yj — min yj 

j J 
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where \_z\ is the floor function which returns the largest integer less than or equal 
to z. 

The computational complexities of histogram construction and Otsu's method 
are 0{Ni) {Ni: number of points in the cluster) and 0{L), respectively. L is constant 
in our experiments and therefore the proposed modification does not alter the linear 
time complexity of Var-Part and PCA-Part. 

Figure [2] shows a histogram where using the mean point as a threshold leads to 
poor results. This histogram is constructed during the first iteration of PCA-Part 
from the projections of the points in the Shuttle data set (see Table [T]). As marked 
on the figure, the mean point of this histogram is 61, whereas Otsu's method gives 
a threshold of 105. The SSE of the initial cluster is 1, 836. When the mean point of 
the histogram is used a threshold, the resulting two subclusters have SSE's of 408 
and 809. This means that splitting the initial cluster with a hyperplane orthogonal 
to the principal eigenvector of the cluster covariance matrix at the mean point 
results in approximately 34% reduction in the SSE. On the other hand, when Otsu's 
threshold is used, the subclusters have SSE's of 943 and 101, which translates to 
about 43% reduction in the SSE. In the next section, we will demonstrate that 
using Otsu's threshold instead of the mean point often leads to significantly better 
initial clusterings on a variety of data sets. 
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Fig. 2. Comparison of mean point and Otsu's thresholds 
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3. Experimental Results 

The experiments were perfor med on 24 commonly used data sets from the UCI 
Machine Learning Repository l^. Tabic [T] gives the data set descriptions. For each 
data set, the number of clusters (K) was set equal to the nurnber of classes {K'), 
as commonly seen in the related literature I 23 | 39 | 40 | 29 | 30 | 31 | 32 | 54 | 

In clustering applications, normalization is a common preprocessing step that 
is necessary to prevent attributes with large ranges from dominating the distance 
calculations and also to avoid numerical instabilities in the computations. Two 
commonly used normalization schemes are linear scaling to unit range (min-max 
normalization) and linear scaling to unit variance (z-score normalization). Several 
studies revealed that the former scheme is preferable t o the l atter since the latter 
is likely to eliminate valuable betwccn-cluster variation l^^'^. As a result, wc used 
min-max normalization to map the attributes of each data set to the [0, 1] interval. 



Table 1. Data Set Descriptions (A^: # points, D: # attributes, K': # classes) 



ID 


Data Set 


A^ 


D 


K' 


1 


Abalone 


4,177 


7 


28 


2 


Breast Cancer Wisconsin (Original) 


683 


9 


2 


3 


Breast Tissue 


106 


9 


6 


4 


Ecoli 


336 


7 


8 


5 


Glass Identification 


214 


9 


6 


6 


Heart Disease 


297 


13 


5 


7 


Ionosphere 


351 


34 


2 


8 


Iris (Bezdek) 


150 


4 


3 


9 


ISOLET 


7,797 


617 


26 


10 


Landsat SateUite (Statlog) 


6,435 


36 


6 


11 


Letter Recognition 


20,000 


16 


26 


12 


MAGIC Gamma Telescope 


19,020 


10 


2 


13 


Multiple Features (Fourier) 


2,000 


76 


10 


14 


Musk (Clean2) 


6,598 


166 


2 


15 


Optical Digits 


5,620 


64 


10 


16 


Page Blocks Classification 


5,473 


10 


5 


17 


Pima Indians Diabetes 


768 


8 


2 


18 


Shuttle (Statlog) 


58,000 


9 


7 


19 


Spambase 


4,601 


57 


2 


20 


SPECTF Heart 


267 


44 


2 


21 


Wall-Following Robot Navigation 


5,456 


24 


4 


22 


Wine Quality 


6,497 


11 


7 


23 


Wine 


178 


13 


3 


24 


Yeast 


1,484 


8 


10 
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The performance of the IMs was quantified using two effectiveness (quahty) and 
two efficiency (speed) criteria: 

t> Initial SSE: This is the SSE value calculated after the initialization phase, 
before the clustering phase. It gives us a measure of the effectiveness of an 
IM by itself. 

> Final SSE: This is the SSE value calculated after the clustering phase. It 
gives us a measure of the effectiveness of an IM when its output is refined by 
k-means. Note that this is the objective function of the k-means algorithm, 
i.e., ©. 

> Number of Iterations: This is the number of iterations that k-means 
requires until reaching convergence when initialized by a particular IM. It 
is an efficiency measure independent of programming language, implemen- 
tation style, compiler, and CPU architecture. 

> CPU Time: This is the total CPU time in milliseconds taken by the 
initialization and clustering phases. 

All of the methods were implemented in the C language, compiled with the gcc 
v4.4.3 compiler, and executed on an Intel Xcon E5520 2.26GHz machine. Time 
measurements were performed using the getrusage function, which is capable of 
measuring CPU time to an accuracy of a microsecond. The MT19937 variant of 
the Mers enn e Twister algorithm was used to generate high quality pseudorandom 
numbers '^. 

The convergence of k-means was controlled by the disjunction of two crite- 
ria: the number of iterations reaches a maximum of 100 or the relative improve- 
ment in SSE between two consecutive iterations drops below a threshold, i.e., 
(SSEi_i - SSE,)/SSE, < e, where SSE, denotes the SSE value at the end of the 
i-th {i g {1, 2, . . . , 100}) iteration. The convergence threshold was set to e = 10~^. 

In this study, we focus on IMs that have time complexity linear in N . This is 
because k-means itself has linear complexity, which is perhaps the most important 
reason for its popularity. Therefore, an IM for k-means should not diminish this 
advantage of the algorithm. The proposed methods, named Otsu Var-Part (OV) 
and Otsu PCA-Part (OP), were compared to six popular, linear, order- invariant IMs: 
Forgy's method (F), MacQuccn's second method (M), maximin (X), k-mcans++ (K), 
Var-Part (V), and PCA-Part (P). It should be noted that among these methods F, 
M, and K are random, whereas }0, V, P, DV, and OP are deterministic. 

We first examine the influence of L (number of histogram bins) on the perfor- 
mance of OV and OP. Tables [2] and [3] show the initial and final SSE values obtained 
by respectively OV and OP for L = 64, 128, 256, 512, 1024 on four of the largest data 
sets (the best values are underlined ) . It can be seen that the performances of both 
methods are relatively insensitive to the value of L. Therefore, in the subsequent 

The first center is chosen as the centroid of the data set. 
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experiments we report the results for L = 256. 



Table 2. Influence of L on the effectiveness of Otsu Var-Part (L: # histogram bins) 



ID 


Criterion 


L = 64 


L = 128 


L = 256 


L = 512 


L = 1024 


9 


Initial SSE 


143859 


144651 


144658 


144637 


144638 




Final SSE 


118267 


119127 


118033 


118033 


118034 


10 


Initial SSE 


1987 


1920 


1919 


1920 


1920 




Final SSE 


1742 


1742 


1742 


1742 


1742 


11 


Initial SSE 


3242 


3192 


3231 


3202 


3202 




Final SSE 


2742 


2734 


2734 


2734 


2734 


15 


Initial SSE 


17448 


17504 


17504 


17504 


17504 




Final SSE 


14581 


14581 


14581 


14581 


14581 



Table 3. Influence of L 


on the effectiveness of Otsu PCA-Part (L: # histogram bins) 


ID Criterion 


L = 64 


L= 128 


L = 256 


L = 512 


L = 1024 


9 Initial SSE 


123527 


123095 


122528 


123129 


123342 


Final SSE 


118575 


118577 


119326 


118298 


118616 


10 Initial SSE 


1855 


1807 


1835 


1849 


1848 


Final SSE 


1742 


1742 


1742 


1742 


1742 


11 Initial SSE 


2994 


2997 


2995 


2995 


2991 


Final SSE 


2747 


2747 


2747 


2747 


2747 


15 Initial SSE 


15136 


15117 


15118 


15116 


15117 


Final SSE 


14650 


14650 


14650 


14650 


14650 



In the remaining experiments, each random method was executed a 100 times 
and statistics such as minimum^ mean, and standard deviation were collected for 
each performance criteria. The minimum and mean statistics represent the best and 
average case performance, respectively, while standard deviation quantifies the vari- 
ability of performance across different runs. Note that for a deterministic method, 
the minimum and mean values are always identical and the standard deviation is 
always 0. 

Tables IIHT] give the performance measurements for each method with respect to 
initial SSE, final SSE, number of iterations, and CPU time, respectively. For each 
of the initial SSE, final SSE, and number of iterations criteria, we calculated the 
ratio of the minimum / mean / standard deviation value obtained by each method to 
the best (least) minimum / mean / standard deviation value on each data set. These 
ratios will be henceforth referred to as the 'normalized' performance criteria. For 
example, on the Abalone data set (1), the minimum initial SSE of F is 424.92 
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and the best minimum initial SSE is 21.57 and thus the normaUzed initial SSE is 
about 20. This simply means that on this data set F obtained approximately 20 
times worse m,inim,um, initial SSE than the best method. We then averageqj these 
normalized values over the 24 data sets to quantify the overall performance of each 
method with respect to each statistic (see Table [8]) . Note that we did not attempt 
to summarize the CPU time values in the same manner due to the sparsity of the 
data (see Table [7]). For convenient visualization, Figure [3] shows box plots of the 
normalized performance criteria. Here, the bottom and top end of the whiskers 
of a box represent the minimum and maximum, respectively, whereas the bottom 
and top of the box itself are the 25th percentile (Ql) and 75th percentile {Q3), 
respectively. The line that passes through the box is the 50th percentile {Q2), i.e., 
the median, while the small square inside the box denotes the mean. 

4. Discussion 

4.1. Best Case Performance Analysis 

With respect to the minimum, statistic, the following observations can be made: 

t> Initial SSE: DP is the best method, followed closely by P, OV, and V. On 
the other hand, F is the worst method, followed by X. These two methods 
give 2-3 times worse m,inimum initial SSE than the best method. It can 
be seen that multiple runs of random methods do not produce good initial 
clusterings. In contrast, only a single run of OP, P, DV, or V often gives very 
good results. This is because these methods arc approximate clustering al- 
gorithms by themselves and thus they give reasonable results even without 
k-means refinement. 

> Final SSE: X is the worst method, while the remaining methods exhibit 
very similar performance. This homogeneity in performance is because k- 
means can take two disparate initial configurations to similar (or even iden- 
tical) local minima. Given the abundance of local minima even in data sets 
of moderate size and/or dimensionality and the gradient descent nature of 
k-means, it is not surprising that the deterministic methods (except X) per- 
form slightly worse than the random methods as the former methods were 
executed only once, whereas the latter ones were executed a 100 times. 

[> Number of Iterations: K is the best method, followed by M and F. X is 
the worst method. As in the case of final SSE, random methods outperform 
deterministic methods due to their multiple-run advantage. 

4.2. Average Case Performance Analysis 

With respect to the mean statistic, the following observations can be made: 

■^Duc to outliers, the 'median' statistic rather than the 'mean' was used to summarize the normal- 
ized standard deviation values. 
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t> Initial SSE: OP is the best method, followed closely by P, OV, and V. 
The remaming methods give 1.7-3.2 times worse mean initial SSE than 
any of these hierarchical methods. Random methods exhibit significantly 
worse average performance than the deterministic ones because the former 
methods can produce highly variable results across different runs (see the 
standard deviation values in Table H} . 

[> Final SSE: This is similar to the case of minimum final SSE, with the 
difference that deterministic methods (except X) are now slightly better 
than the random ones. Once again, this is because random methods can 
produce highly variable results due to their stochastic nature. 

> Number of Iterations: The ranking of the methods is similar to the case 
of mean final SSE. 

4.3. Consistency Analysis 

With respect to the standard deviation statistic, F is significantly better than both 
M and K. If, however, the application requires absolute consistency, i.e., exactly the 
same clustering in every run, a deterministic IM should be used. 

4.4. CPU Time Analysis 

It can be seen from Table [7] that, on about half of the data sets, each of the IMs 
require less than a few milliseconds of CPU time. On the other hand, on large 
and/or high-dimensional data sets efficiency differences become more prominent. It 
should be noted that each of the values reported in this table corresponds to a single 
k-means 'run'. In practice, a random method is typically executed R times, e.g., 
in this study R = 100, and the output of the run that gives the least final SSE is 
taken as the result. Therefore, the total computational cost of a random method is 
often significantly higher than that of a deterministic method. For example, on the 
ISOLET data set, which has the greatest N x D x K value among all the data sets, 
K took on the average 3, 397 milliseconds, whereas OP took 12, 460 milliseconds. The 
latter method, however, required about 27 times less CPU time than the former 
one since the former was executed a total of 100 times. 

4.5. Relative Performance Analysis 

We also determined the number of data sets (out of 24) on which OV and OP re- 
spectively performed {worse than/same as/better than} V and P. Tables \9\ and ITOl 
present the results for OV and OP, respectively. It can be seen that, with respect to 
initial SSE and number of iterations criteria, OV outperforms V more often than not. 
On the other hand, OP frequently outperforms P with respect to both criteria. As 
for final SSE, OP performs slightly better than P, whereas OV performs slightly worse 
than V. It appears that Otsu's method benefits P more than it benefits V. This is 
most likely due to the fact that histograms of projections over the major axis nee- 
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essarily have a greater dynaraic range and variability and thus are more amenable 
to thresholding compared to histograms of projections over any coordinate axis. 

4.6. Recommendations for Practitioners 

Based on the analyses presented above, the following recommendations can be 
made: 

i> In general, X should not be used. As mentioned in Section [21 this method 
was not designed specifically as a k- means initializer'^. It is easy to under- 
stand and implement, but is mostly ineffective and unreliable. Furthermore, 
despite its low overhead, this method does not offer significant time savings 
since it often results in slow k-means convergence. 

i> In applications that involve small data sets, e.g., N < 1,000, K should be 
used. It is computationally feasible to run this method hundreds of times 
on such data sets given that one such run takes only a few milliseconds. 

> In time-critical applications that involve large data sets or applications 
that demand determinism, the hierarchical methods should be used. These 
methods need to be executed only once and they lead to reasonably fast 
k-means convergence. The efficiency difference between V/OV and P/OP is 
noticeable only on high dimensional data sets such as ISOLET (D = 617) 
and Musk (D = 166). This is because V/OV calculates the direction of split 
by determining the coordinate axis with the greatest variance (in 0{D) 
time), whereas P/OP achieves this by calculating the principal eigenvector 
of the cluster covariance matrix (in 0{D'^) time using the power method 
l^^. Note that despite its higher computational complexity, P/OP can. in 
some cases, be more efficient than V/OV (sec Table [7]). This is because 
the former converges significantly faster than the latter (see Table [8]). The 
main disadvantage of these methods is that they are more complicated to 
implement due to their hierarchical formulation. 

t> In applications where an approximate clustering of the data set is desired, 
the hierarchical methods should be used. These methods produce very good 
initial clusterings, which makes it possible to use them as standalone clus- 
tering algorithms. 

> Among the hierarchical methods, the ones based on PCA, i.e., P and OP, 
are preferable to those based on variance, i.e., V and OV. Furthermore, 
the proposed OP and OV methods generally outperform their respective 
counterparts, i.e., P and V, especially with respect to initial SSE and number 
of iterations. 

5. Conclusions 

In this paper, we presented a simple modification to Var-Part and PCA-Part, two 
hierarchical k-means initialization methods that are linear, deterministic, and order- 
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invariant. We compared the original methods and their modified versions to some of 
the most popular linear initialization methods, namely Forgy's method, Macqueen's 
second method, maximin, and k-means++, on a large and diverse collection of data 
sets from the UCI Machine Learning Repository. The results demonstrated that, 
despite their deterministic nature, Var-Part and PCA-Part are highly competitive 
with one of the best random initialization methods to date, i.e., k-meansH — h. In 
addition, the proposed modification significantly improves the performance of both 
hierarchical methods. The presented Var-Part and PCA-Part variants can be used 
to initialize k-means effectively, particularly in time-critical applications that in- 
volve large data sets. Alternatively, they can be used as approximate clustering 
algorithms without additional k-means refinement. 
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Table 4. Initial SSE comparison of the initialization methods 



1 1 
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P 


QV 


OP 
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mean 
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483 ± 20 


33 

46 ± 10 


29 
34 ± 2 


95 
95 ± 


24 
24 ± 


23 
23 ± 


23 
23 ± 


22 
22 ± 


2 


mm 


534 
575 ± 15 


318 
706 ± 354 


304 
560 ± 349 


498 
498 ± 


247 
247 ± 


240 
240 ± 


258 
258 ± 


239 
239 ± 


3 


mean 


20 
27 ± 3 


11 
20 ± 8 


9 
13 ± 2 


19 
19 ± 


8 
8 ± 


8 
8 ± 


8 
8 ± 


7 ± 
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mean 


54 
61 ± 2 


26 
40 ± 7 


26 
33 ± 5 


48 
48 ± 


20 
20 ± 


19 
19 ± 


19 
19 ± 


20 
20 ± 
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mean 


42 
48 ± 2 


24 
40 ± 9 


25 
32 ± 5 


45 
45 ± 


21 
21 ± 


20 
20 ± 


21 
21 ± 


IS 
IS ± 
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mean 


372 
396 ± 8 


361 
463 ± 58 


341 
450 ± 49 


409 
409 ± 


249 
249 ± 


250 
250 ± 


249 
249 ± 


244 
244 ± 
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mm 


771 
814 ± 12 


749 
1246 ± 463 


720 
1237 ± 468 


827 
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632 
632 ± 


629 
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629 
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8 
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34 ± 4 


9 
28 ± 23 


9 
16 ± 6 


18 
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8 
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Table 5. Final SSE comparison of the initialization methods 
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36373 
37058 ± 1626 


36373 
36373 ± 


36373 
36373 ± 


36373 
36373 ± 


36373 
36373 ± 


36373 
36373 ± 


15 


mean 


14559 
14687 ± 216 


14559 
14752 ± 236 


14559 
14747 ± 245 


14679 
14679 ± 


14581 
14581 ± 


14807 
14807 ± 


14581 
14581 ± 


14650 
14650 ± 


16 


mm 


215 
217 ± 4 


215 
216 ± 2 


215 
220 ± 7 


230 
230 ± 


227 ± 


215 
215 ± 


229 
229 ± 


216 
216 ± 


17 


mean 


121 
121 ± 


121 
122 ± 5 


121 
122 ± 5 


121 
121 ± 


121 
121 ± 


121 
121 ± 


121 
121 ± 


121 
121 ± 


18 


mm 


235 
317 ± 46 


235 
272 ± 23 


235 
260 ± 31 


726 
726 ± 


235 

235 ± 


274 
274 ± 


274 
274 ± 


235 

235 ± 


19 


mm 


765 
77S ± 3 


765 
779 ± 14 


765 
7S5 ± 19 


765 
765 ± 


778 
778 ± 


778 
778 ± 


778 
778 ± 


765 
765 ± 


20 


mean 


214 
214 ± 


214 
215 ± 5 


214 
214 ± 


214 
214 ± 


214 
214 ± 


214 
214 ± 


214 
214 ± 


214 
214 ± 


21 


mm 


7772 
7799 ± 93 


7772 
7821 ± 124 


7772 
7831 ± 140 


7772 
7772 ± 


7774 
7774 ± 


7774 
7774 ± 


7774 
7774 ± 


7772 
7772 ± 


22 


mm 


334 
335 ± 1 


334 
336 ± 3 


334 
336 ± 3 


399 
399 ± 


335 
335 ± 


334 
334 ± 


335 
335 ± 


335 
335 ± 


23 


mm 


49 
49 ± 


49 
49 ± 2 


49 

49 ± 2 


63 
63 ± 


49 

49 ± 


49 
49 ± 


49 
49 ± 


49 
49 ± 


24 


mm 


58 
64 ± 6 


58 
69 ± 6 


58 
63 ± 5 


61 

61 ± 


69 
69 ± 


59 
59 ± 


69 
69 ± 


59 
59 ± 
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Table 6. Number of iterations comparison of the initialization methods 



1 1 


H 


K 


X 


V 




P 


DV 


DP 




1 


mean 


59 
90 ± 11 


29 
68 ± 19 


22 

48 ± 17 


100 
100 ± 


50 
50 ± 


43 


43 

± 


31 
31 ± 


38 
3S ± 


2 


mean 


4 

5 ± 


4 
6 ± 1 


4 
6 ± 1 


8 
8 ± 


4 

4 ± 


4 


4 
± 


5 

5 ± 


3 

3 ± 


3 


mean 


10 ± 2 


5 
9 ± 3 


3 
7 ± 2 


7 ± 


6 

6 ± 


7 


7 
± 


5 ± 


3 

3 ± 


4 


mean 


8 
15 ± 6 


8 
15 ± 5 


7 
14 ± 5 


14 
14 ± 


17 
17 ± 


7 


7 
± 


12 

12 ± 


6 
6 ± 


5 


mean 


6 
10 ± 3 


11 ± 4 


4 
9 ± 3 


6 

6 ± 


6 ± 


5 


± 


9 

9 ± 


4 
4 ± 


6 


mean 


11 ± 3 


10 ± 3 


5 
9 ± 3 


12 
12 ± 


3 

3 ± 


4 


4 
± 


3 

3 ± 


4 
4 ± 


7 


min 


4 

5 ± 1 


3 

7 ± 2 


3 
8 ± 2 


3 

3 ± 


3 

3 ± 


3 


3 

± 


4 
4 ± 


2 
2 ± 


8 


mean 


4 
9 ± 3 


4 

8 ± 2 


3 
7 ± 3 


6 
6 ± 


4 
4 ± 


4 


4 
± 


6 
6 ± 


3 

3 ± 


9 


mean 


IS 
43 ± 15 


19 

40 ± 14 


14 
36 ± 13 


32 

32 ± 


82 
82 ± 


45 


45 
± 


59 

59 ± 


39 
39 ± 


10 


mean 


12 

28 ± 8 


12 
33 ± 10 


11 
29 ± 9 


53 
53 ± 


28 
28 ± 


27 


27 
± 


10 
10 ± 


23 
23 ± 


11 


mean 


39 
75 ± 19 


37 

72 ± IS 


31 
76 ± 18 


73 

73 ± 


100 
100 ± 


83 


83 
± 


67 
67 ± 


85 
85 ± 


12 


mean 




18 ± 5 


10 
18 ± 5 


10 
20 ± 6 


35 

35 ± 


25 
25 ± 


10 


10 
± 


26 

26 ± 


9 
9 ± 


13 


mean 


13 
29 ± 10 


14 
30 ± 10 


13 
30 ± 11 


37 
37 ± 


14 
14 ± 


25 


25 
± 


17 
17 ± 


13 
13 ± 


14 


mean 


4 
6 ± 1 


4 
6 ± 1 


4 
6 ± 1 


8 
8 ± 


5 ± 


5 


± 


5 ± 


3 

3 ± 


15 


mean 


12 
31 ± 13 


12 
33 ± 14 


14 
30 ± 10 


36 

36 ± 


16 
16 ± 


22 


22 

± 


15 
15 ± 


59 
59 ± 


16 


mean 


14 

27 ± 9 


12 
31 ± 14 


9 
24 ± 11 


27 
27 ± 


25 
25 ± 


15 


15 
± 


19 
19 ± 


16 
16 ± 


17 


mean 


8 
13 ± 2 


4 
12 ± 5 


4 
11 ± 4 


19 
19 ± 


11 
11 ± 


10 


10 
± 


8 
8 ± 


5 ± 


18 


mm 


10 

25 ± 9 


8 
25 ± 11 


9 
23 ± 9 


22 

22 ± 


30 
30 ± 


16 


16 

± 


7 

7 ± 


27 
27 ± 


19 


mean 


6 

12 ± 5 


3 

14 ± 6 


3 
12 ± 7 


5 ± 


9 

9 ± 


10 


10 
± 


12 

12 ± 


3 
3 ± 


20 


mean 


6 

8 ± 1 


8 ± 2 


4 
7 ± 2 


7 
7± 


7 
7 ± 


7 


± 


6 
6 ± 


2 
2 ± 


21 


min 


9 
20 ± 8 


11 
22 ± 8 


11 
20 ± 8 


24 

24 ± 


20 
20 ± 


8 


8 
± 


21 
21 ± 


19 
19 ± 


22 


mean 


15 
41 ± 20 


17 
42 ± IS 


18 
40 ± 18 


20 
20 ± 


62 
62 ± 


50 


50 
± 


33 
33 ± 


49 
49 ± 


23 


min 


4 

7 ± 2 


1 
8 ± 3 


1 
7 ± 3 


9 
9 ± 


5 
5 ± 


7 


7 
± 


5 
5 ± 


5 
5 ± 


24 


mean 


13 
29 ± 10 


13 
31 ± 11 


15 
29 ± 10 


73 

73 ± 


33 
33 ± 


21 


21 
± 


28 
28 ± 


32 
32 ± 
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Table 7. CPU time comparison of the initialization methods 



1 1 


H 


K 


=< 


V 


p 


ov 


DP 




1 


mean 


20 

43 ± 7 


10 
33 ± 10 


10 
24 ± 9 


40 
40 ± 


20 
20 ± 


20 
20 ± 


20 
20 ± 


10 
10 ± 


2 


mm 




Oil 



± 1 




± 2 



± 



± 



± 



± 




± 


3 


mean 



± 1 



± 1 



± 



± 



± 



± 




± 



± 


4 


mean 




± 2 



± 1 




± 2 



± 



± 



± 




± 



± 


5 


mean 




± 1 



± 1 



± 1 



± 



± 



± 



± 



± 


6 


mean 



± 1 




± 2 



1 ± 2 



± 



± 



± 



± 



± 


7 


mm 




± 2 



± 2 




± 2 



± 



± 




± 




± 



± 


8 


mean 



± 



± 1 



± 1 



± 



± 



± 



± 



± 


9 


mean 


1690 
3691 ± 1229 


1630 
3370 ± 1178 


1580 
3397 ± 1055 


2570 
2570 ± 


6920 
6920 ± 


12160 
12160 ± 


5040 
5040 ± 


12460 
12460 ± 


ID 


mm 



30 ± 9 


10 
32 ± 11 


10 
32 ± 10 


50 
50 ± 


30 

30 ± 


50 
50 ± 


10 
10 ± 


50 
50 ± 


11 


mean 


380 
710 ± 174 


350 
673 ± 168 


320 
724 ± 166 


670 
670 ± 


960 
960 ± 


790 
790 ± 


620 
620 ± 


810 
810 ± 


12 


mean 


10 

19 ± 7 


10 

19 ± 7 


10 
21 ± 6 


40 
40 ± 


20 
20 ± 


10 
10 ± 


30 
30 ± 


20 
20 ± 


13 


mean 


20 
52 ± 17 


20 

52 ± 17 


20 
57 ± 20 


60 
60 ± 


20 
20 ± 


SO 
80 ± 


30 
30 ± 


40 
40 ± 


14 


mean 


10 
22 ± 7 


10 
21 ± 5 


10 
25 ± 8 


30 
30 ± 


30 
30 ± 


210 
210 ± 


20 
20 ± 


200 
200 ± 


15 


mean 


50 
122 ± 50 


60 
126 ± 53 


50 
124 ± 41 


140 

140 ± 


60 
60 ± 


140 
140 ± 


70 
70 ± 


280 
280 ± 


16 


mm 




9 ± 5 




11 ± 6 




8 ± 5 


10 
10 ± 


10 

10 ± 


10 
10 ± 


10 
10 ± 


10 
10 ± 


17 


mean 




± 2 



1 ± 2 



1 ± 2 



± 



± 



± 



± 



± 


18 


mm 


40 

87 ± 27 


30 

87 ± 37 


30 

79 ± 26 


50 
50 ± 


100 
100 ± 


70 
70 ± 


30 
30 ± 


100 
100 ± 


19 


mm 




11 ± 5 



12 ± 6 




11 ± 8 


10 
10 it 


10 
10 ± 


30 
30 ± 


10 
10 ± 


20 
20 ± 


20 


mean 




± 2 



± 2 



± 2 



± 



± 



± 



± 




± 


21 


mm 


10 
21 ± 9 


10 
20 ± 8 


10 
20 ± 8 


20 
20 ± 


20 
20 ± 


20 
20 ± 


20 
20 ± 


30 
30 ± 


22 


mm 


10 
40 ± 19 


20 

40 ± 18 


20 
38 ± 17 


10 
10 ± 


60 
60 ± 


50 
50 ± 


40 
40 ± 


40 
40 ± 


23 


mm 



± 1 



± 1 




± 



± 



± 



± 



± 



± 


24 


mm 




6 ± 5 



6 ± 5 



6 ± 5 


10 
10 ± 



± 



± 


10 
10 ± 



± 
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Table 8. Overall performance comparison of the initialization methods 



Statistic 


Criterion 


F 


M 


K 


X 


V 


P 


ov 


OP 


Min 


Init. SSE 


2.968 


1.418 


1.348 


2.184 


1.107 


1.043 


1.067 


1.002 




Final SSE 


1.001 


1.003 


1.000 


1.163 


1.019 


1.018 


1.031 


1.005 




# Iters. 


1.488 


1.284 


1.183 


2.978 


2.469 


2.013 


2.034 


1.793 


Mean 


Init. SSE 


3.250 


2.171 


1.831 


2.184 


1.107 


1.043 


1.067 


1.002 




Final SSE 


1.047 


1.049 


1.032 


1.161 


1.017 


1.016 


1.029 


1.003 




# Iters. 


2.466 


2.528 


2.314 


2.581 


2.057 


1.715 


1.740 


1.481 


Stdcv 


Init. SSE 


1.000 


13.392 


12.093 


- 


- 


- 


- 


- 




Final SSE 


1.013 


1.239 


1.172 


- 


- 


- 


- 


- 




# Iters. 


1.000 


1.166 


1.098 


- 


- 


- 


- 


- 



Table 9. Performance of Otsu 


Var-Part relative to Var-Part (V) 


Criterion 


Worse 


than V 


Same 


as V 


Better than V 


Init. SSE 




6 




3 


15 


Final SSE 




6 




16 


2 


# Iters. 




8 




3 


13 



Table 10. Performance of Otsu PCA-Part relative to PCA-Part (P) 
Criterion Worse than P Same as P Better than P 



Init. SSE 

Final SSE 

# Iters. 



2 

14 

1 



21 
6 

17 
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Fig. 3. Box plots of the normalized performance criteria 



